Loan_Modelling.csv - the raw data that is used for the project.
ID: Customer IDAge: Customer’s age in completed yearsExperience: #years of professional experienceIncome: Annual income of the customer (in thousand dollars)ZIP Code: Home Address ZIP code.Family: the Family size of the customerCCAvg: Average spending on credit cards per month (in thousand dollars)Education: Education Level. 1: Undergrad; 2: Graduate;3: Advanced/ProfessionalMortgage: Value of house mortgage if any. (in thousand dollars)Personal_Loan: Did this customer accept the personal loan offered in the last campaign?Securities_Account: Does the customer have securities account with the bank?CD_Account: Does the customer have a certificate of deposit (CD) account with the bank?Online: Do customers use internet banking facilities?CreditCard: Does the customer use a credit card issued by any other Bank (excluding All life Bank)?# this will help in making the Python code more structured automatically (good coding practice)
%load_ext nb_black
# Library to suppress warnings or deprecation notes
import warnings
warnings.filterwarnings("ignore")
# Libraries to help with reading and manipulating data
import pandas as pd
import numpy as np
# Library to split data
from sklearn.model_selection import train_test_split
# libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
# resize the pictures:
plt.rc("figure", figsize=[10, 6])
# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)
# To build model for prediction
from sklearn.linear_model import LogisticRegression
# Libraries to build decision tree classifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
# To tune different models
from sklearn.model_selection import GridSearchCV
# To get diferent metric scores
from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
roc_auc_score,
plot_confusion_matrix,
precision_recall_curve,
roc_curve,
make_scorer,
)
# Get the uszipcode package
#!pip install uszipcode
from uszipcode import SearchEngine
from sklearn import metrics
loan = pd.read_csv("Loan_Modelling.csv")
# copy the data to another variable to avoid overwritting the original data
data = loan.copy()
data.head()
| ID | Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 25 | 1 | 49 | 91107 | 4 | 1.6 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 1 | 2 | 45 | 19 | 34 | 90089 | 3 | 1.5 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 2 | 3 | 39 | 15 | 11 | 94720 | 1 | 1.0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 4 | 35 | 9 | 100 | 94112 | 1 | 2.7 | 2 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 5 | 35 | 8 | 45 | 91330 | 4 | 1.0 | 2 | 0 | 0 | 0 | 0 | 0 | 1 |
data.drop(["ID"], axis=1, inplace=True)
search = SearchEngine(simple_zipcode=False)
# create a function to find County from zipcode:
def find_county(zipcode):
county = search.by_zipcode(zipcode).county
if county is None:
county = zipcode
return county
data["County"] = data["ZIPCode"].apply(find_county)
data["County"].value_counts()
Los Angeles County 1095 San Diego County 568 Santa Clara County 563 Alameda County 500 Orange County 339 San Francisco County 257 San Mateo County 204 Sacramento County 184 Santa Barbara County 154 Yolo County 130 Monterey County 128 Ventura County 114 San Bernardino County 101 Contra Costa County 85 Santa Cruz County 68 Riverside County 56 Marin County 54 Kern County 54 Solano County 33 San Luis Obispo County 33 Humboldt County 32 Sonoma County 28 Fresno County 26 Placer County 24 92717 22 Butte County 19 Shasta County 18 El Dorado County 17 Stanislaus County 15 San Benito County 14 San Joaquin County 13 Mendocino County 8 Tuolumne County 7 Siskiyou County 7 96651 6 92634 5 Lake County 4 Merced County 4 Trinity County 4 Napa County 3 Imperial County 3 93077 1 Name: County, dtype: int64
data.drop(["ZIPCode"], axis=1, inplace=True)
data.head()
| Age | Experience | Income | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | County | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 25 | 1 | 49 | 4 | 1.6 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | Los Angeles County |
| 1 | 45 | 19 | 34 | 3 | 1.5 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | Los Angeles County |
| 2 | 39 | 15 | 11 | 1 | 1.0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | Alameda County |
| 3 | 35 | 9 | 100 | 1 | 2.7 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | San Francisco County |
| 4 | 35 | 8 | 45 | 4 | 1.0 | 2 | 0 | 0 | 0 | 0 | 0 | 1 | Los Angeles County |
data.shape
(5000, 13)
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5000 entries, 0 to 4999 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Age 5000 non-null int64 1 Experience 5000 non-null int64 2 Income 5000 non-null int64 3 Family 5000 non-null int64 4 CCAvg 5000 non-null float64 5 Education 5000 non-null int64 6 Mortgage 5000 non-null int64 7 Personal_Loan 5000 non-null int64 8 Securities_Account 5000 non-null int64 9 CD_Account 5000 non-null int64 10 Online 5000 non-null int64 11 CreditCard 5000 non-null int64 12 County 5000 non-null object dtypes: float64(1), int64(11), object(1) memory usage: 507.9+ KB
data.duplicated().sum()
0
data.describe()
| Age | Experience | Income | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 5000.000000 | 5000.000000 | 5000.000000 | 5000.000000 | 5000.000000 | 5000.000000 | 5000.000000 | 5000.000000 | 5000.000000 | 5000.00000 | 5000.000000 | 5000.000000 |
| mean | 45.338400 | 20.104600 | 73.774200 | 2.396400 | 1.937938 | 1.881000 | 56.498800 | 0.096000 | 0.104400 | 0.06040 | 0.596800 | 0.294000 |
| std | 11.463166 | 11.467954 | 46.033729 | 1.147663 | 1.747659 | 0.839869 | 101.713802 | 0.294621 | 0.305809 | 0.23825 | 0.490589 | 0.455637 |
| min | 23.000000 | -3.000000 | 8.000000 | 1.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 |
| 25% | 35.000000 | 10.000000 | 39.000000 | 1.000000 | 0.700000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 |
| 50% | 45.000000 | 20.000000 | 64.000000 | 2.000000 | 1.500000 | 2.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 1.000000 | 0.000000 |
| 75% | 55.000000 | 30.000000 | 98.000000 | 3.000000 | 2.500000 | 3.000000 | 101.000000 | 0.000000 | 0.000000 | 0.00000 | 1.000000 | 1.000000 |
| max | 67.000000 | 43.000000 | 224.000000 | 4.000000 | 10.000000 | 3.000000 | 635.000000 | 1.000000 | 1.000000 | 1.00000 | 1.000000 | 1.000000 |
data.isna().sum()
Age 0 Experience 0 Income 0 Family 0 CCAvg 0 Education 0 Mortgage 0 Personal_Loan 0 Securities_Account 0 CD_Account 0 Online 0 CreditCard 0 County 0 dtype: int64
def generate_plot(data, feature, figsize=(10, 6), kde=True, bins=None):
"""
Description:
This is the function that generate both boxplot and histogram for any input numerical variable.
Inputs:
data: dataframe of the dataset
feature: dataframe column
figsize: size of figure (default (10,6))
kde: whether to show the density curve (default False)
bins: number of bins for histogram (default None)
Output:
Boxplot and histogram
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2,
sharex=True,
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
)
# This is for boxplot
sns.boxplot(data=data, x=feature, ax=ax_box2, showmeans=True, color="violet")
# This is for histogram
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
) if bins else sns.histplot(data=data, x=feature, kde=kde, ax=ax_hist2)
# Add mean to the histogram
ax_hist2.axvline(data[feature].mean(), color="green", linestyle="--")
# Add median to the histogram
ax_hist2.axvline(data[feature].median(), color="black", linestyle="-")
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5000 entries, 0 to 4999 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Age 5000 non-null int64 1 Experience 5000 non-null int64 2 Income 5000 non-null int64 3 Family 5000 non-null int64 4 CCAvg 5000 non-null float64 5 Education 5000 non-null int64 6 Mortgage 5000 non-null int64 7 Personal_Loan 5000 non-null int64 8 Securities_Account 5000 non-null int64 9 CD_Account 5000 non-null int64 10 Online 5000 non-null int64 11 CreditCard 5000 non-null int64 12 County 5000 non-null object dtypes: float64(1), int64(11), object(1) memory usage: 507.9+ KB
generate_plot(data, "Age")
generate_plot(data, "Experience")
generate_plot(data, "Income")
generate_plot(data, "Family")
generate_plot(data, "CCAvg")
generate_plot(data, "Education")
generate_plot(data, "Mortgage")
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5000 entries, 0 to 4999 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Age 5000 non-null int64 1 Experience 5000 non-null int64 2 Income 5000 non-null int64 3 Family 5000 non-null int64 4 CCAvg 5000 non-null float64 5 Education 5000 non-null int64 6 Mortgage 5000 non-null int64 7 Personal_Loan 5000 non-null int64 8 Securities_Account 5000 non-null int64 9 CD_Account 5000 non-null int64 10 Online 5000 non-null int64 11 CreditCard 5000 non-null int64 12 County 5000 non-null object dtypes: float64(1), int64(11), object(1) memory usage: 507.9+ KB
def count_statistic(dataframe, feature):
'''
Description:
This is a function to count the values of each type in each variable, and also do the percentage of each type.
Inputs:
dataframe - the dataset
feature - the column name
Output:
Count of each type and percentage
'''
count_values = dataframe[feature].value_counts()
print('Counting:')
print(count_values)
print('\n')
print('Population proportion:')
print(count_values/count_values.sum())
def generate_countplot(data, feature):
"""
Description:
This is a function to do countplot
Inputs:
data - the dataset
feature - the column name
Output:
The count plot
"""
sns.countplot(data=data, x=feature)
count_statistic(data, "Personal_Loan")
Counting: 0 4520 1 480 Name: Personal_Loan, dtype: int64 Population proportion: 0 0.904 1 0.096 Name: Personal_Loan, dtype: float64
generate_countplot(data, "Personal_Loan")
count_statistic(data, "Securities_Account")
Counting: 0 4478 1 522 Name: Securities_Account, dtype: int64 Population proportion: 0 0.8956 1 0.1044 Name: Securities_Account, dtype: float64
generate_countplot(data, "Securities_Account")
count_statistic(data, "CD_Account")
Counting: 0 4698 1 302 Name: CD_Account, dtype: int64 Population proportion: 0 0.9396 1 0.0604 Name: CD_Account, dtype: float64
generate_countplot(data, "Securities_Account")
count_statistic(data, "Online")
Counting: 1 2984 0 2016 Name: Online, dtype: int64 Population proportion: 1 0.5968 0 0.4032 Name: Online, dtype: float64
generate_countplot(data, "Online")
count_statistic(data, "CreditCard")
Counting: 0 3530 1 1470 Name: CreditCard, dtype: int64 Population proportion: 0 0.706 1 0.294 Name: CreditCard, dtype: float64
generate_countplot(data, "CreditCard")
count_statistic(data, "County")
Counting: Los Angeles County 1095 San Diego County 568 Santa Clara County 563 Alameda County 500 Orange County 339 San Francisco County 257 San Mateo County 204 Sacramento County 184 Santa Barbara County 154 Yolo County 130 Monterey County 128 Ventura County 114 San Bernardino County 101 Contra Costa County 85 Santa Cruz County 68 Riverside County 56 Marin County 54 Kern County 54 Solano County 33 San Luis Obispo County 33 Humboldt County 32 Sonoma County 28 Fresno County 26 Placer County 24 92717 22 Butte County 19 Shasta County 18 El Dorado County 17 Stanislaus County 15 San Benito County 14 San Joaquin County 13 Mendocino County 8 Tuolumne County 7 Siskiyou County 7 96651 6 92634 5 Lake County 4 Merced County 4 Trinity County 4 Napa County 3 Imperial County 3 93077 1 Name: County, dtype: int64 Population proportion: Los Angeles County 0.2190 San Diego County 0.1136 Santa Clara County 0.1126 Alameda County 0.1000 Orange County 0.0678 San Francisco County 0.0514 San Mateo County 0.0408 Sacramento County 0.0368 Santa Barbara County 0.0308 Yolo County 0.0260 Monterey County 0.0256 Ventura County 0.0228 San Bernardino County 0.0202 Contra Costa County 0.0170 Santa Cruz County 0.0136 Riverside County 0.0112 Marin County 0.0108 Kern County 0.0108 Solano County 0.0066 San Luis Obispo County 0.0066 Humboldt County 0.0064 Sonoma County 0.0056 Fresno County 0.0052 Placer County 0.0048 92717 0.0044 Butte County 0.0038 Shasta County 0.0036 El Dorado County 0.0034 Stanislaus County 0.0030 San Benito County 0.0028 San Joaquin County 0.0026 Mendocino County 0.0016 Tuolumne County 0.0014 Siskiyou County 0.0014 96651 0.0012 92634 0.0010 Lake County 0.0008 Merced County 0.0008 Trinity County 0.0008 Napa County 0.0006 Imperial County 0.0006 93077 0.0002 Name: County, dtype: float64
plt.rc("figure", figsize=[20, 20])
sns.countplot(data=data, y="County")
<AxesSubplot:xlabel='count', ylabel='County'>
sns.pairplot(data, diag_kind="kde")
<seaborn.axisgrid.PairGrid at 0x1d2314aed30>
# 2-D matrix:
correlation = data.corr()
correlation
| Age | Experience | Income | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Age | 1.000000 | 0.994215 | -0.055269 | -0.046418 | -0.052012 | 0.041334 | -0.012539 | -0.007726 | -0.000436 | 0.008043 | 0.013702 | 0.007681 |
| Experience | 0.994215 | 1.000000 | -0.046574 | -0.052563 | -0.050077 | 0.013152 | -0.010582 | -0.007413 | -0.001232 | 0.010353 | 0.013898 | 0.008967 |
| Income | -0.055269 | -0.046574 | 1.000000 | -0.157501 | 0.645984 | -0.187524 | 0.206806 | 0.502462 | -0.002616 | 0.169738 | 0.014206 | -0.002385 |
| Family | -0.046418 | -0.052563 | -0.157501 | 1.000000 | -0.109275 | 0.064929 | -0.020445 | 0.061367 | 0.019994 | 0.014110 | 0.010354 | 0.011588 |
| CCAvg | -0.052012 | -0.050077 | 0.645984 | -0.109275 | 1.000000 | -0.136124 | 0.109905 | 0.366889 | 0.015086 | 0.136534 | -0.003611 | -0.006689 |
| Education | 0.041334 | 0.013152 | -0.187524 | 0.064929 | -0.136124 | 1.000000 | -0.033327 | 0.136722 | -0.010812 | 0.013934 | -0.015004 | -0.011014 |
| Mortgage | -0.012539 | -0.010582 | 0.206806 | -0.020445 | 0.109905 | -0.033327 | 1.000000 | 0.142095 | -0.005411 | 0.089311 | -0.005995 | -0.007231 |
| Personal_Loan | -0.007726 | -0.007413 | 0.502462 | 0.061367 | 0.366889 | 0.136722 | 0.142095 | 1.000000 | 0.021954 | 0.316355 | 0.006278 | 0.002802 |
| Securities_Account | -0.000436 | -0.001232 | -0.002616 | 0.019994 | 0.015086 | -0.010812 | -0.005411 | 0.021954 | 1.000000 | 0.317034 | 0.012627 | -0.015028 |
| CD_Account | 0.008043 | 0.010353 | 0.169738 | 0.014110 | 0.136534 | 0.013934 | 0.089311 | 0.316355 | 0.317034 | 1.000000 | 0.175880 | 0.278644 |
| Online | 0.013702 | 0.013898 | 0.014206 | 0.010354 | -0.003611 | -0.015004 | -0.005995 | 0.006278 | 0.012627 | 0.175880 | 1.000000 | 0.004210 |
| CreditCard | 0.007681 | 0.008967 | -0.002385 | 0.011588 | -0.006689 | -0.011014 | -0.007231 | 0.002802 | -0.015028 | 0.278644 | 0.004210 | 1.000000 |
# heatmap:
plt.rc("figure", figsize=[15, 6])
sns.heatmap(correlation, annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")
<AxesSubplot:>
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5000 entries, 0 to 4999 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Age 5000 non-null int64 1 Experience 5000 non-null int64 2 Income 5000 non-null int64 3 Family 5000 non-null int64 4 CCAvg 5000 non-null float64 5 Education 5000 non-null int64 6 Mortgage 5000 non-null int64 7 Personal_Loan 5000 non-null int64 8 Securities_Account 5000 non-null int64 9 CD_Account 5000 non-null int64 10 Online 5000 non-null int64 11 CreditCard 5000 non-null int64 12 County 5000 non-null object dtypes: float64(1), int64(11), object(1) memory usage: 507.9+ KB
# Create a function to do stacked plot:
def stacked_barplot(data, predictor, target):
"""
Print the category counts and plot a stacked bar chart
data: dataframe
predictor: independent variable
target: target variable
"""
count = data[predictor].nunique()
sorter = data[target].value_counts().index[-1]
tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
by=sorter, ascending=False
)
print(tab1)
print("-" * 120)
tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
by=sorter, ascending=False
)
tab.plot(kind="bar", stacked=True, figsize=(20, 6))
plt.legend(
loc="lower left",
frameon=False,
)
plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
plt.show()
stacked_barplot(data, "Age", "Personal_Loan")
Personal_Loan 0 1 All Age All 4520 480 5000 34 116 18 134 30 119 17 136 36 91 16 107 63 92 16 108 35 135 16 151 33 105 15 120 52 130 15 145 29 108 15 123 54 128 15 143 43 134 15 149 42 112 14 126 56 121 14 135 65 66 14 80 44 107 14 121 50 125 13 138 45 114 13 127 46 114 13 127 26 65 13 78 32 108 12 120 57 120 12 132 38 103 12 115 27 79 12 91 48 106 12 118 61 110 12 122 53 101 11 112 51 119 10 129 60 117 10 127 58 133 10 143 49 105 10 115 47 103 10 113 59 123 9 132 28 94 9 103 62 114 9 123 55 116 9 125 64 70 8 78 41 128 8 136 40 117 8 125 37 98 8 106 31 118 7 125 39 127 6 133 24 28 0 28 25 53 0 53 66 24 0 24 67 12 0 12 23 12 0 12 ------------------------------------------------------------------------------------------------------------------------
sns.pointplot(x="Age", y="Personal_Loan", data=data, estimator=sum, ci=None)
<AxesSubplot:xlabel='Age', ylabel='Personal_Loan'>
stacked_barplot(data, "Experience", "Personal_Loan")
Personal_Loan 0 1 All Experience All 4520 480 5000 9 127 20 147 8 101 18 119 20 131 17 148 3 112 17 129 12 86 16 102 32 140 14 154 19 121 14 135 5 132 14 146 25 128 14 142 26 120 14 134 37 103 13 116 11 103 13 116 16 114 13 127 30 113 13 126 22 111 13 124 35 130 13 143 23 131 13 144 36 102 12 114 29 112 12 124 7 109 12 121 6 107 12 119 18 125 12 137 31 92 12 104 28 127 11 138 21 102 11 113 13 106 11 117 17 114 11 125 34 115 10 125 39 75 10 85 27 115 10 125 4 104 9 113 2 76 9 85 24 123 8 131 1 66 8 74 38 80 8 88 10 111 7 118 33 110 7 117 0 59 7 66 41 36 7 43 14 121 6 127 15 114 5 119 40 53 4 57 42 8 0 8 43 3 0 3 -2 15 0 15 -1 33 0 33 -3 4 0 4 ------------------------------------------------------------------------------------------------------------------------
sns.pointplot(x="Experience", y="Personal_Loan", data=data, estimator=sum, ci=None)
<AxesSubplot:xlabel='Experience', ylabel='Personal_Loan'>
stacked_barplot(data, "Income", "Personal_Loan")
Personal_Loan 0 1 All Income All 4520 480 5000 130 8 11 19 182 2 11 13 158 8 10 18 135 8 10 18 179 8 9 17 141 15 9 24 154 12 9 21 123 9 9 18 184 3 9 12 142 7 8 15 131 11 8 19 129 15 8 23 172 3 8 11 173 5 8 13 170 4 8 12 180 10 8 18 115 19 8 27 125 16 7 23 164 6 7 13 188 3 7 10 83 67 7 74 114 23 7 30 161 9 7 16 122 17 7 24 133 8 7 15 132 11 7 18 191 6 7 13 134 13 7 20 111 15 7 22 190 4 7 11 145 17 6 23 140 13 6 19 178 4 6 10 118 13 6 19 185 3 6 9 165 5 6 11 168 2 6 8 169 1 6 7 183 6 6 12 120 11 6 17 139 10 6 16 113 29 5 34 119 13 5 18 99 19 5 24 138 13 5 18 155 14 5 19 195 10 5 15 174 4 5 9 175 7 5 12 152 10 5 15 153 7 4 11 181 4 4 8 103 14 4 18 93 33 4 37 108 12 4 16 101 20 4 24 194 4 4 8 192 2 4 6 193 2 4 6 143 5 4 9 149 16 4 20 171 5 4 9 160 8 4 12 159 3 4 7 128 20 4 24 148 7 4 11 162 7 3 10 112 23 3 26 110 16 3 19 124 9 3 12 105 17 3 20 104 17 3 20 102 13 3 16 109 15 3 18 95 22 3 25 150 9 2 11 94 24 2 26 163 7 2 9 91 35 2 37 98 26 2 28 89 32 2 34 121 18 2 20 85 63 2 65 144 5 2 7 65 59 1 60 71 42 1 43 69 45 1 46 100 9 1 10 60 51 1 52 189 1 1 2 73 43 1 44 151 3 1 4 201 4 1 5 64 59 1 60 202 1 1 2 81 82 1 83 92 28 1 29 90 37 1 38 75 46 1 47 82 60 1 61 84 62 1 63 203 1 1 2 33 51 0 51 198 3 0 3 31 55 0 55 30 63 0 63 29 67 0 67 28 63 0 63 25 64 0 64 24 47 0 47 23 54 0 54 22 65 0 65 224 1 0 1 21 65 0 65 20 47 0 47 19 52 0 52 200 3 0 3 218 1 0 1 18 53 0 53 204 3 0 3 15 33 0 33 199 3 0 3 14 31 0 31 13 32 0 32 12 30 0 30 205 2 0 2 11 27 0 27 10 23 0 23 32 58 0 58 48 44 0 44 34 53 0 53 35 65 0 65 9 26 0 26 88 26 0 26 80 56 0 56 79 53 0 53 78 61 0 61 74 45 0 45 72 41 0 41 70 47 0 47 68 35 0 35 63 46 0 46 62 55 0 55 61 57 0 57 59 53 0 53 58 55 0 55 55 61 0 61 54 52 0 52 53 57 0 57 52 47 0 47 51 41 0 41 50 45 0 45 49 52 0 52 45 69 0 69 44 85 0 85 43 70 0 70 42 77 0 77 41 82 0 82 40 78 0 78 39 81 0 81 38 84 0 84 8 23 0 23 ------------------------------------------------------------------------------------------------------------------------
sns.pointplot(x="Income", y="Personal_Loan", data=data, estimator=sum, ci=None)
<AxesSubplot:xlabel='Income', ylabel='Personal_Loan'>
sns.boxplot(data=data, x="Personal_Loan", y="Income")
<AxesSubplot:xlabel='Personal_Loan', ylabel='Income'>
stacked_barplot(data, "County", "Personal_Loan")
Personal_Loan 0 1 All County All 4520 480 5000 Los Angeles County 984 111 1095 Santa Clara County 492 71 563 San Diego County 509 59 568 Alameda County 456 44 500 Orange County 309 30 339 San Francisco County 238 19 257 Monterey County 113 15 128 Sacramento County 169 15 184 Contra Costa County 73 12 85 San Mateo County 192 12 204 Ventura County 103 11 114 Santa Barbara County 143 11 154 Santa Cruz County 60 8 68 Yolo County 122 8 130 Kern County 47 7 54 Sonoma County 22 6 28 Riverside County 50 6 56 Marin County 48 6 54 San Luis Obispo County 28 5 33 Solano County 30 3 33 Shasta County 15 3 18 92717 19 3 22 San Bernardino County 98 3 101 Butte County 17 2 19 Fresno County 24 2 26 Humboldt County 30 2 32 Placer County 22 2 24 Mendocino County 7 1 8 El Dorado County 16 1 17 Stanislaus County 14 1 15 San Joaquin County 12 1 13 96651 6 0 6 93077 1 0 1 Tuolumne County 7 0 7 Trinity County 4 0 4 Napa County 3 0 3 Siskiyou County 7 0 7 Merced County 4 0 4 Imperial County 3 0 3 Lake County 4 0 4 San Benito County 14 0 14 92634 5 0 5 ------------------------------------------------------------------------------------------------------------------------
plt.rc("figure", figsize=[20, 20])
sns.pointplot(y="County", x="Personal_Loan", data=data, estimator=sum, ci=None)
<AxesSubplot:xlabel='Personal_Loan', ylabel='County'>
stacked_barplot(data, "Family", "Personal_Loan")
Personal_Loan 0 1 All Family All 4520 480 5000 4 1088 134 1222 3 877 133 1010 1 1365 107 1472 2 1190 106 1296 ------------------------------------------------------------------------------------------------------------------------
plt.rc("figure", figsize=[10, 6])
sns.boxplot(data=data, x="Personal_Loan", y="Family")
<AxesSubplot:xlabel='Personal_Loan', ylabel='Family'>
stacked_barplot(data, "CCAvg", "Personal_Loan")
Personal_Loan 0 1 All CCAvg All 4520 480 5000 3.0 34 19 53 4.1 9 13 22 3.4 26 13 39 3.1 8 12 20 4.2 0 11 11 5.4 8 10 18 6.5 8 10 18 3.8 33 10 43 3.6 17 10 27 3.3 35 10 45 5.0 9 9 18 3.9 18 9 27 2.9 45 9 54 2.6 79 8 87 6.0 18 8 26 4.4 9 8 17 4.3 18 8 26 0.2 196 8 204 0.5 155 8 163 4.7 17 7 24 5.2 9 7 16 1.3 121 7 128 2.7 51 7 58 3.7 18 7 25 1.1 77 7 84 5.6 0 7 7 4.0 26 7 33 2.2 123 7 130 4.8 0 7 7 5.1 0 6 6 0.7 163 6 169 6.1 8 6 14 1.2 60 6 66 3.5 9 6 15 4.6 8 6 14 0.3 235 6 241 0.8 182 5 187 6.9 9 5 14 4.9 17 5 22 6.3 8 5 13 3.2 17 5 22 2.3 53 5 58 1.4 131 5 136 2.8 105 5 110 7.0 9 5 14 5.7 8 5 13 2.4 87 5 92 5.9 0 5 5 1.9 102 4 106 1.7 154 4 158 7.9 0 4 4 4.5 25 4 29 0.6 114 4 118 5.5 0 4 4 2.0 184 4 188 0.4 175 4 179 5.3 0 4 4 7.4 9 4 13 1.5 174 4 178 7.2 9 4 13 6.6 0 4 4 1.6 98 3 101 10.0 0 3 3 6.4 0 3 3 0.9 103 3 106 8.0 9 3 12 7.5 9 3 12 2.1 97 3 100 1.8 149 3 152 5.8 0 3 3 6.2 0 2 2 9.0 0 2 2 8.5 0 2 2 6.8 8 2 10 8.3 0 2 2 4.25 0 2 2 5.67 0 2 2 4.75 0 2 2 0.1 181 2 183 1.0 229 2 231 2.5 105 2 107 7.3 9 1 10 9.3 0 1 1 8.9 0 1 1 8.8 8 1 9 8.2 0 1 1 8.1 9 1 10 5.33 0 1 1 0.0 105 1 106 3.67 0 1 1 3.25 0 1 1 3.33 0 1 1 4.67 0 1 1 6.33 9 1 10 2.75 0 1 1 2.67 36 0 36 0.67 18 0 18 0.75 9 0 9 4.33 9 0 9 8.6 8 0 8 6.67 9 0 9 1.33 9 0 9 6.7 9 0 9 1.67 18 0 18 1.75 9 0 9 7.8 9 0 9 7.6 9 0 9 2.33 18 0 18 ------------------------------------------------------------------------------------------------------------------------
plt.rc("figure", figsize=[30, 6])
sns.pointplot(x="CCAvg", y="Personal_Loan", data=data, estimator=sum, ci=None)
<AxesSubplot:xlabel='CCAvg', ylabel='Personal_Loan'>
sns.boxplot(data=data, x="Personal_Loan", y="CCAvg")
<AxesSubplot:xlabel='Personal_Loan', ylabel='CCAvg'>
stacked_barplot(data, "Education", "Personal_Loan")
Personal_Loan 0 1 All Education All 4520 480 5000 3 1296 205 1501 2 1221 182 1403 1 2003 93 2096 ------------------------------------------------------------------------------------------------------------------------
stacked_barplot(data, "Mortgage", "Personal_Loan")
Personal_Loan 0 1 All Mortgage All 4520 480 5000 0 3150 312 3462 301 0 5 5 342 1 3 4 282 0 3 3 ... ... ... ... 276 2 0 2 156 5 0 5 278 1 0 1 280 2 0 2 248 3 0 3 [348 rows x 3 columns] ------------------------------------------------------------------------------------------------------------------------
sns.boxplot(data=data, x="Personal_Loan", y="Mortgage")
<AxesSubplot:xlabel='Personal_Loan', ylabel='Mortgage'>
stacked_barplot(data, "Securities_Account", "Personal_Loan")
Personal_Loan 0 1 All Securities_Account All 4520 480 5000 0 4058 420 4478 1 462 60 522 ------------------------------------------------------------------------------------------------------------------------
stacked_barplot(data, "CD_Account", "Personal_Loan")
Personal_Loan 0 1 All CD_Account All 4520 480 5000 0 4358 340 4698 1 162 140 302 ------------------------------------------------------------------------------------------------------------------------
stacked_barplot(data, "Online", "Personal_Loan")
Personal_Loan 0 1 All Online All 4520 480 5000 1 2693 291 2984 0 1827 189 2016 ------------------------------------------------------------------------------------------------------------------------
stacked_barplot(data, "CreditCard", "Personal_Loan")
Personal_Loan 0 1 All CreditCard All 4520 480 5000 0 3193 337 3530 1 1327 143 1470 ------------------------------------------------------------------------------------------------------------------------
Data Description:
Univariate Data Analysis:
Age: There are no outliers in Age variable. The distribution is unskewed.Experience: There are no outliers in Experience variable. The distribution is unskewed.Income: There are outliers, and the distribution is right skewed.County: Most customers live in Los Angeles county, about 1095 people wiht 21.9%, and the second popular county is San Diego, which is about 568 people with 11.36%.Family: There are no outliers in Family variable.CCAvg: There are outliers, and the distribution is right skewed.Education: There are no outliers in the Education variable.Mortgage: There are outliers, and the distribution is right skewed.Personal_Loan: There are bout 4520 customers, which is 90.4% that do not accept the personal loan offered during the last campaign. Only 480 customers do.Securities_Account: There are 4478 customers, which is 89.56% that do not have securities account with the bank, while 522 customers do.CD_Account: There are 4698 customers, which is 93.96% that don't have CD account, while 302 customers have.Online: There are 2984 customers have online banking account, which is 59.68%, while 2016 customers do not have.CreditCard: There are 3530 customers, which is 70.6% that don't have credit card, while 1470 customers have.Bivariate Data Analysis:
outlier = []
def find_z_score(data, feature, threshold=3):
"""
Description:
This is a function to detect number of outliers.
Inputs:
data - the dataset
feature - column name
threshold - value is 3 because any points that fall outside 3 standard deviation is an outlier
Output:
Number of outliers in a variables
"""
mean = np.mean(data[feature])
std = np.std(data[feature])
for value in data[feature]:
z_score = (value - mean) / std
# use absolute on z score to have more accurate result
if np.abs(z_score) > threshold:
outlier.append(value)
return outlier
target_columns = ["Income", "CCAvg", "Mortgage"]
# Detect number of outliers for target variables:
for column in target_columns:
outliers = find_z_score(data, column)
print("There are ", len(outliers), " outliers in ", column, " variable")
print("-" * 20)
There are 2 outliers in Income variable -------------------- There are 123 outliers in CCAvg variable -------------------- There are 228 outliers in Mortgage variable --------------------
def IQR_method(data, feature):
'''
Description:
- This is a function that uses Interquartile range (IQR) method to do outlier treatment.
- Q1 is known as 25th percentile. Q3 is known as 75th percentile. IQR= Q3-Q1
- Any data points that fall outside the minimum (Q1-1.5*IQR) and maximum (Q3+1.5*IQR) are outliers.
- Hence, the data points that are less than the minimum, will be replaced with the minimum values.
- Data points that are greater than the maximum values, will be replaced with the maximum values.
Inputs:
data - the dataset
feature - column name
Output:
Updated values for outliers
'''
Q1 = data[feature].quantile(0.25)
Q3 = data[feature].quantile(0.75)
IQR = Q3-Q1
lower_range = Q1 - 1.5*IQR
upper_range = Q3 + 1.5*IQR
#replace outliers with lower range values and upper range values:
data[feature] = np.where(data[feature] < lower_range, lower_range, data[feature])
data[feature] = np.where(data[feature] > upper_range, upper_range, data[feature])
# Outlier treatment for target variables:
for column in target_columns:
IQR_method(data, column)
# Do the plots for target variables to see if the method improves the outliers:
for column in target_columns:
generate_plot(data, column)
plt.show()
data[data["Experience"] < 0]
| Age | Experience | Income | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | County | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 89 | 25 | -1 | 113.0 | 4 | 2.30 | 3 | 0.0 | 0 | 0 | 0 | 0 | 1 | San Mateo County |
| 226 | 24 | -1 | 39.0 | 2 | 1.70 | 2 | 0.0 | 0 | 0 | 0 | 0 | 0 | Santa Clara County |
| 315 | 24 | -2 | 51.0 | 3 | 0.30 | 3 | 0.0 | 0 | 0 | 0 | 1 | 0 | Orange County |
| 451 | 28 | -2 | 48.0 | 2 | 1.75 | 3 | 89.0 | 0 | 0 | 0 | 1 | 0 | San Francisco County |
| 524 | 24 | -1 | 75.0 | 4 | 0.20 | 1 | 0.0 | 0 | 0 | 0 | 1 | 0 | Santa Barbara County |
| 536 | 25 | -1 | 43.0 | 3 | 2.40 | 2 | 176.0 | 0 | 0 | 0 | 1 | 0 | San Diego County |
| 540 | 25 | -1 | 109.0 | 4 | 2.30 | 3 | 252.5 | 0 | 0 | 0 | 1 | 0 | San Mateo County |
| 576 | 25 | -1 | 48.0 | 3 | 0.30 | 3 | 0.0 | 0 | 0 | 0 | 0 | 1 | Orange County |
| 583 | 24 | -1 | 38.0 | 2 | 1.70 | 2 | 0.0 | 0 | 0 | 0 | 1 | 0 | San Benito County |
| 597 | 24 | -2 | 125.0 | 2 | 5.20 | 1 | 0.0 | 0 | 1 | 0 | 0 | 1 | Orange County |
| 649 | 25 | -1 | 82.0 | 4 | 2.10 | 3 | 0.0 | 0 | 0 | 0 | 1 | 0 | Orange County |
| 670 | 23 | -1 | 61.0 | 4 | 2.60 | 1 | 239.0 | 0 | 0 | 0 | 1 | 0 | San Bernardino County |
| 686 | 24 | -1 | 38.0 | 4 | 0.60 | 2 | 0.0 | 0 | 0 | 0 | 1 | 0 | Orange County |
| 793 | 24 | -2 | 150.0 | 2 | 2.00 | 1 | 0.0 | 0 | 0 | 0 | 1 | 0 | Alameda County |
| 889 | 24 | -2 | 82.0 | 2 | 1.60 | 3 | 0.0 | 0 | 0 | 0 | 1 | 1 | Los Angeles County |
| 909 | 23 | -1 | 149.0 | 1 | 5.20 | 1 | 252.5 | 0 | 0 | 0 | 0 | 1 | San Bernardino County |
| 1173 | 24 | -1 | 35.0 | 2 | 1.70 | 2 | 0.0 | 0 | 0 | 0 | 0 | 0 | Santa Clara County |
| 1428 | 25 | -1 | 21.0 | 4 | 0.40 | 1 | 90.0 | 0 | 0 | 0 | 1 | 0 | Contra Costa County |
| 1522 | 25 | -1 | 101.0 | 4 | 2.30 | 3 | 252.5 | 0 | 0 | 0 | 0 | 1 | Alameda County |
| 1905 | 25 | -1 | 112.0 | 2 | 2.00 | 1 | 241.0 | 0 | 0 | 0 | 1 | 0 | Riverside County |
| 2102 | 25 | -1 | 81.0 | 2 | 1.60 | 3 | 0.0 | 0 | 0 | 0 | 1 | 1 | Orange County |
| 2430 | 23 | -1 | 73.0 | 4 | 2.60 | 1 | 0.0 | 0 | 0 | 0 | 1 | 0 | San Diego County |
| 2466 | 24 | -2 | 80.0 | 2 | 1.60 | 3 | 0.0 | 0 | 0 | 0 | 1 | 0 | San Francisco County |
| 2545 | 25 | -1 | 39.0 | 3 | 2.40 | 2 | 0.0 | 0 | 0 | 0 | 1 | 0 | Alameda County |
| 2618 | 23 | -3 | 55.0 | 3 | 2.40 | 2 | 145.0 | 0 | 0 | 0 | 1 | 0 | Orange County |
| 2717 | 23 | -2 | 45.0 | 4 | 0.60 | 2 | 0.0 | 0 | 0 | 0 | 1 | 1 | Lake County |
| 2848 | 24 | -1 | 78.0 | 2 | 1.80 | 2 | 0.0 | 0 | 0 | 0 | 0 | 0 | Alameda County |
| 2876 | 24 | -2 | 80.0 | 2 | 1.60 | 3 | 238.0 | 0 | 0 | 0 | 0 | 0 | Los Angeles County |
| 2962 | 23 | -2 | 81.0 | 2 | 1.80 | 2 | 0.0 | 0 | 0 | 0 | 0 | 0 | Los Angeles County |
| 2980 | 25 | -1 | 53.0 | 3 | 2.40 | 2 | 0.0 | 0 | 0 | 0 | 0 | 0 | Santa Clara County |
| 3076 | 29 | -1 | 62.0 | 2 | 1.75 | 3 | 0.0 | 0 | 0 | 0 | 0 | 1 | Orange County |
| 3130 | 23 | -2 | 82.0 | 2 | 1.80 | 2 | 0.0 | 0 | 1 | 0 | 0 | 1 | San Diego County |
| 3157 | 23 | -1 | 13.0 | 4 | 1.00 | 1 | 84.0 | 0 | 0 | 0 | 1 | 0 | Alameda County |
| 3279 | 26 | -1 | 44.0 | 1 | 2.00 | 2 | 0.0 | 0 | 0 | 0 | 0 | 0 | Marin County |
| 3284 | 25 | -1 | 101.0 | 4 | 2.10 | 3 | 0.0 | 0 | 0 | 0 | 0 | 1 | Sacramento County |
| 3292 | 25 | -1 | 13.0 | 4 | 0.40 | 1 | 0.0 | 0 | 1 | 0 | 0 | 0 | Yolo County |
| 3394 | 25 | -1 | 113.0 | 4 | 2.10 | 3 | 0.0 | 0 | 0 | 0 | 1 | 0 | Los Angeles County |
| 3425 | 23 | -1 | 12.0 | 4 | 1.00 | 1 | 90.0 | 0 | 0 | 0 | 1 | 0 | Los Angeles County |
| 3626 | 24 | -3 | 28.0 | 4 | 1.00 | 3 | 0.0 | 0 | 0 | 0 | 0 | 0 | Los Angeles County |
| 3796 | 24 | -2 | 50.0 | 3 | 2.40 | 2 | 0.0 | 0 | 1 | 0 | 0 | 0 | Marin County |
| 3824 | 23 | -1 | 12.0 | 4 | 1.00 | 1 | 0.0 | 0 | 1 | 0 | 0 | 1 | Santa Cruz County |
| 3887 | 24 | -2 | 118.0 | 2 | 5.20 | 1 | 0.0 | 0 | 1 | 0 | 1 | 0 | 92634 |
| 3946 | 25 | -1 | 40.0 | 3 | 2.40 | 2 | 0.0 | 0 | 0 | 0 | 1 | 0 | Santa Barbara County |
| 4015 | 25 | -1 | 139.0 | 2 | 2.00 | 1 | 0.0 | 0 | 0 | 0 | 0 | 1 | Santa Barbara County |
| 4088 | 29 | -1 | 71.0 | 2 | 1.75 | 3 | 0.0 | 0 | 0 | 0 | 0 | 0 | Contra Costa County |
| 4116 | 24 | -2 | 135.0 | 2 | 5.20 | 1 | 0.0 | 0 | 0 | 0 | 1 | 0 | Los Angeles County |
| 4285 | 23 | -3 | 149.0 | 2 | 5.20 | 1 | 0.0 | 0 | 0 | 0 | 1 | 0 | Kern County |
| 4411 | 23 | -2 | 75.0 | 2 | 1.80 | 2 | 0.0 | 0 | 0 | 0 | 1 | 1 | Los Angeles County |
| 4481 | 25 | -2 | 35.0 | 4 | 1.00 | 3 | 0.0 | 0 | 0 | 0 | 1 | 0 | San Benito County |
| 4514 | 24 | -3 | 41.0 | 4 | 1.00 | 3 | 0.0 | 0 | 0 | 0 | 1 | 0 | Los Angeles County |
| 4582 | 25 | -1 | 69.0 | 3 | 0.30 | 3 | 0.0 | 0 | 0 | 0 | 1 | 0 | Orange County |
| 4957 | 29 | -1 | 50.0 | 2 | 1.75 | 3 | 0.0 | 0 | 0 | 0 | 0 | 1 | Sacramento County |
data["Experience"] = np.where(
data["Experience"] < 0, data["Experience"].median(), data["Experience"]
)
data[data["Experience"] < 0]
| Age | Experience | Income | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | County |
|---|
data["Personal_Loan"].value_counts()
0 4520 1 480 Name: Personal_Loan, dtype: int64
Outlier Treatment
Fix Experience variable
# Personal_Loan is target variable, also a dependent variable
X = data.drop(["Personal_Loan"], axis=1)
y = data["Personal_Loan"]
y.value_counts()
0 4520 1 480 Name: Personal_Loan, dtype: int64
X = pd.get_dummies(
X,
columns=["County"],
drop_first=True,
)
X.head()
| Age | Experience | Income | Family | CCAvg | Education | Mortgage | Securities_Account | CD_Account | Online | CreditCard | County_92717 | County_93077 | County_96651 | County_Alameda County | County_Butte County | County_Contra Costa County | County_El Dorado County | County_Fresno County | County_Humboldt County | County_Imperial County | County_Kern County | County_Lake County | County_Los Angeles County | County_Marin County | County_Mendocino County | County_Merced County | County_Monterey County | County_Napa County | County_Orange County | County_Placer County | County_Riverside County | County_Sacramento County | County_San Benito County | County_San Bernardino County | County_San Diego County | County_San Francisco County | County_San Joaquin County | County_San Luis Obispo County | County_San Mateo County | County_Santa Barbara County | County_Santa Clara County | County_Santa Cruz County | County_Shasta County | County_Siskiyou County | County_Solano County | County_Sonoma County | County_Stanislaus County | County_Trinity County | County_Tuolumne County | County_Ventura County | County_Yolo County | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 25 | 1.0 | 49.0 | 4 | 1.6 | 1 | 0.0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 45 | 19.0 | 34.0 | 3 | 1.5 | 1 | 0.0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 39 | 15.0 | 11.0 | 1 | 1.0 | 1 | 0.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 35 | 9.0 | 100.0 | 1 | 2.7 | 2 | 0.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 35 | 8.0 | 45.0 | 4 | 1.0 | 2 | 0.0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
x_train.shape
(3500, 52)
x_test.shape
(1500, 52)
# There are different solvers available in Sklearn logistic regression
# The newton-cg solver is faster for high-dimensional data
logistic_regression = LogisticRegression(solver="newton-cg", random_state=1)
logistic_regression.fit(x_train, y_train)
LogisticRegression(random_state=1, solver='newton-cg')
pd.DataFrame(logistic_regression.coef_[0], x_train.columns, columns=["Coefficients"])
| Coefficients | |
|---|---|
| Age | 0.062762 |
| Experience | -0.056574 |
| Income | 0.055665 |
| Family | 0.780346 |
| CCAvg | 0.387583 |
| Education | 1.740822 |
| Mortgage | 0.001127 |
| Securities_Account | -0.821316 |
| CD_Account | 3.218479 |
| Online | -0.567922 |
| CreditCard | -0.962282 |
| County_92717 | 0.573386 |
| County_93077 | -0.001961 |
| County_96651 | -0.000634 |
| County_Alameda County | -0.138966 |
| County_Butte County | -0.255888 |
| County_Contra Costa County | 0.644567 |
| County_El Dorado County | -0.154299 |
| County_Fresno County | -0.027399 |
| County_Humboldt County | -0.322269 |
| County_Imperial County | -0.012620 |
| County_Kern County | 0.502465 |
| County_Lake County | -0.003957 |
| County_Los Angeles County | -0.006388 |
| County_Marin County | 0.405326 |
| County_Mendocino County | -0.048394 |
| County_Merced County | -0.230589 |
| County_Monterey County | -0.068182 |
| County_Napa County | -0.004825 |
| County_Orange County | -0.228930 |
| County_Placer County | 0.644699 |
| County_Riverside County | 1.003984 |
| County_Sacramento County | 0.059395 |
| County_San Benito County | -0.269374 |
| County_San Bernardino County | -0.894218 |
| County_San Diego County | 0.015193 |
| County_San Francisco County | 0.279827 |
| County_San Joaquin County | 0.009296 |
| County_San Luis Obispo County | -0.432049 |
| County_San Mateo County | -0.981119 |
| County_Santa Barbara County | 0.077643 |
| County_Santa Clara County | 0.146526 |
| County_Santa Cruz County | 0.144907 |
| County_Shasta County | -0.198765 |
| County_Siskiyou County | -0.022618 |
| County_Solano County | 0.203565 |
| County_Sonoma County | 0.453298 |
| County_Stanislaus County | -0.269774 |
| County_Trinity County | -0.095355 |
| County_Tuolumne County | -0.143039 |
| County_Ventura County | 0.138589 |
| County_Yolo County | -0.485701 |
# Convert the coefficients of log to odds:
odds = np.exp(logistic_regression.coef_[0])
odds
array([ 1.06477362, 0.94499699, 1.05724391, 2.18222814, 1.47341502,
5.70202788, 1.00112788, 0.43985242, 24.99008259, 0.56670176,
0.38202004, 1.77426544, 0.99804074, 0.99936623, 0.87025802,
0.77422898, 1.90516269, 0.85701573, 0.97297312, 0.72450311,
0.98745977, 1.65279084, 0.99605085, 0.99363243, 1.49979118,
0.95275873, 0.79406606, 0.93409036, 0.99518643, 0.79538411,
1.90541327, 2.72913369, 1.06119417, 0.76385785, 0.40892707,
1.01530938, 1.32290102, 1.00933944, 0.64917727, 0.37489134,
1.0807368 , 1.15780508, 1.15593237, 0.81974222, 0.97763631,
1.2257652 , 1.57349319, 0.7635519 , 0.90904993, 0.86671999,
1.14865162, 0.61526558])
# Find the percentage change in odds:
odds_percent = (odds - 1) * 100
odds_percent
array([ 6.47736213e+00, -5.50030102e+00, 5.72439054e+00, 1.18222814e+02,
4.73415015e+01, 4.70202788e+02, 1.12787683e-01, -5.60147575e+01,
2.39900826e+03, -4.33298237e+01, -6.17979957e+01, 7.74265437e+01,
-1.95925877e-01, -6.33771625e-02, -1.29741978e+01, -2.25771019e+01,
9.05162689e+01, -1.42984275e+01, -2.70268791e+00, -2.75496886e+01,
-1.25402260e+00, 6.52790838e+01, -3.94915384e-01, -6.36757146e-01,
4.99791179e+01, -4.72412682e+00, -2.05933941e+01, -6.59096405e+00,
-4.81357285e-01, -2.04615886e+01, 9.05413270e+01, 1.72913369e+02,
6.11941658e+00, -2.36142149e+01, -5.91072929e+01, 1.53093787e+00,
3.22901019e+01, 9.33943718e-01, -3.50822728e+01, -6.25108663e+01,
8.07368002e+00, 1.57805076e+01, 1.55932374e+01, -1.80257784e+01,
-2.23636868e+00, 2.25765198e+01, 5.73493194e+01, -2.36448096e+01,
-9.09500700e+00, -1.33280015e+01, 1.48651619e+01, -3.84734418e+01])
# removing limit from number of columns to display
pd.set_option("display.max_columns", None)
# Create a dataframe:
pd.DataFrame(
{"Odds": odds, "Odds_Percent_Change": odds_percent}, index=x_train.columns
).T
| Age | Experience | Income | Family | CCAvg | Education | Mortgage | Securities_Account | CD_Account | Online | CreditCard | County_92717 | County_93077 | County_96651 | County_Alameda County | County_Butte County | County_Contra Costa County | County_El Dorado County | County_Fresno County | County_Humboldt County | County_Imperial County | County_Kern County | County_Lake County | County_Los Angeles County | County_Marin County | County_Mendocino County | County_Merced County | County_Monterey County | County_Napa County | County_Orange County | County_Placer County | County_Riverside County | County_Sacramento County | County_San Benito County | County_San Bernardino County | County_San Diego County | County_San Francisco County | County_San Joaquin County | County_San Luis Obispo County | County_San Mateo County | County_Santa Barbara County | County_Santa Clara County | County_Santa Cruz County | County_Shasta County | County_Siskiyou County | County_Solano County | County_Sonoma County | County_Stanislaus County | County_Trinity County | County_Tuolumne County | County_Ventura County | County_Yolo County | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Odds | 1.064774 | 0.944997 | 1.057244 | 2.182228 | 1.473415 | 5.702028 | 1.001128 | 0.439852 | 24.990083 | 0.566702 | 0.382020 | 1.774265 | 0.998041 | 0.999366 | 0.870258 | 0.774229 | 1.905163 | 0.857016 | 0.972973 | 0.724503 | 0.987460 | 1.652791 | 0.996051 | 0.993632 | 1.499791 | 0.952759 | 0.794066 | 0.934090 | 0.995186 | 0.795384 | 1.905413 | 2.729134 | 1.061194 | 0.763858 | 0.408927 | 1.015309 | 1.322901 | 1.009339 | 0.649177 | 0.374891 | 1.080737 | 1.157805 | 1.155932 | 0.819742 | 0.977636 | 1.225765 | 1.573493 | 0.763552 | 0.909050 | 0.866720 | 1.148652 | 0.615266 |
| Odds_Percent_Change | 6.477362 | -5.500301 | 5.724391 | 118.222814 | 47.341502 | 470.202788 | 0.112788 | -56.014758 | 2399.008259 | -43.329824 | -61.797996 | 77.426544 | -0.195926 | -0.063377 | -12.974198 | -22.577102 | 90.516269 | -14.298427 | -2.702688 | -27.549689 | -1.254023 | 65.279084 | -0.394915 | -0.636757 | 49.979118 | -4.724127 | -20.593394 | -6.590964 | -0.481357 | -20.461589 | 90.541327 | 172.913369 | 6.119417 | -23.614215 | -59.107293 | 1.530938 | 32.290102 | 0.933944 | -35.082273 | -62.510866 | 8.073680 | 15.780508 | 15.593237 | -18.025778 | -2.236369 | 22.576520 | 57.349319 | -23.644810 | -9.095007 | -13.328001 | 14.865162 | -38.473442 |
Age: Holding all other features constant, a unit change in Age will increase the odds of a customer accept the personal loan offer by 1.113 times or a 11.325% increase in the odds.Experience: Assume all other features are constant, a unit change in Experience will decrease the odds of a customer buying the personal loan by 0.924 times or a 7.593% decrease in the odds.Income: Holding all other features constant, a unit change in Income will increase the odds of a customer buying the personal loan by 1.01 times or a 0.667% increase in the odds.Family: Assume all other features are constant, a unit change in Family will decrease the odds of a customer accept the personal loan offer by 0.611 times or a 38.869% decrease in the odds.CCAvg: Holding all other features constant, a unit change in CCAvg will decrease the odds of a customer buying the personal loan by 0.739 times or a 26.146% decrease in the odds.Education: Holding all other features constant, a unit change in Education will decrease the odds of a customer buying the personal loan by 0.365 times or a 63.494% decrease in the odds.Mortgage: Holding all other features constant, a unit change in Mortgage will increase the odds of a customer buying the personal loan by 1.00 times or a 0.13% increase in the odds.Securities_Account: Holding all other features constant, a unit change in Securities_Account will increase the odds of a customer buying the personal loan by 1.482 times or a 48.226% in the odds.CD_Account: Holding all other features constant, a unit change in CD_Account will decrease the odds of a customer buying the personal loan by 0.156 times or 84.356% in the odds.Online: Holding all other features constant, a unit change in Online will increase the odds of a customer buying the personal loan by 1.233 times or 23.256% in the odds.CreditCard: Holding all other features constant, a unit change in CreditCard will increase the odds of a customer buying the personal loan by 1.538 times or 53.785% in the odds.Above are coefficient explanations for some columns. Interpretation for other columns can be done similarly.
# Create confusion matrix with heatmap
def create_confusion_matrix(model, predictor, target, threshold=0.5):
"""
Description:
This is the function to create confusion matrix and heatmap
Inputs:
model: classifier
predictor - independent variables
target - dependent variables
threshold: threshold for classifying the observation as class 1
Outputs:
Heatmap plot with confusion matrix values
"""
# Do the prediction:
prediction = np.round(
logistic_regression.predict_proba(predictor)[:, 1] > threshold
)
# Confusion matrix:
logistic_cm = metrics.confusion_matrix(target, prediction, labels=[1, 0])
df_cm = pd.DataFrame(
logistic_cm,
index=[i for i in ["Actual 1", " Actual 0"]],
columns=[i for i in ["Predict 1", "Predict 0"]],
)
plt.figure(figsize=(7, 5))
sns.heatmap(df_cm, annot=True, fmt="g")
plt.show()
# Create a function to compute the model metrics:
def logistic_model_performance(model, predictor, target, threshold=0.5):
"""
Description:
This is the function to compute the model metrics
Inputs:
model: classifier
predictor - independent variables
target - dependent variables
threshold: threshold for classifying the observation as class 1
Outputs:
Model metrics
"""
# Do the prediction:
prediction = np.round(
logistic_regression.predict_proba(predictor)[:, 1] > threshold
)
# Calculate the accuracy:
accuracy = accuracy_score(target, prediction)
# Calculate recall:
recall = recall_score(target, prediction)
# Calculate Precision:
precision = precision_score(target, prediction)
# Calculate F1 score:
f1 = f1_score(target, prediction)
# creating a dataframe of metrics
metrics_dataframe = pd.DataFrame(
{
"Accuracy": accuracy,
"Recall": recall,
"Precision": precision,
"F1": f1,
},
index=[0],
)
return metrics_dataframe
# Confusion matrix for train dataset:
create_confusion_matrix(logistic_regression, x_train, y_train)
Confusion matrix interpretation for training dataset
# Calculate the model metrics for train dataset:
logistic_train_metrics = logistic_model_performance(
logistic_regression, x_train, y_train
)
logistic_train_metrics
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.960571 | 0.688822 | 0.86692 | 0.767677 |
Looking at the model metrics for train dataset:
Overall, the model is good, but it needs the improvement.
# Confusion matrix for test dataset:
create_confusion_matrix(logistic_regression, x_test, y_test)
Confusion matrix interpretation for testing dataset
# Calculate the model metrics for test dataset:
logistic_test_metrics = logistic_model_performance(logistic_regression, x_test, y_test)
logistic_test_metrics
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.946667 | 0.577181 | 0.834951 | 0.68254 |
Looking at the model metrics for test dataset:
Overall, the model needs some improvements since the Recall is quite low as well as F1 score.
logit_roc_auc_train = roc_auc_score(
y_train, logistic_regression.predict_proba(x_train)[:, 1]
)
fpr, tpr, thresholds = roc_curve(
y_train, logistic_regression.predict_proba(x_train)[:, 1]
)
plt.figure(figsize=(7, 5))
plt.plot(fpr, tpr, label="Logistic Regression (area = %0.2f)" % logit_roc_auc_train)
plt.plot([0, 1], [0, 1], "r--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver operating characteristic")
plt.legend(loc="lower right")
plt.show()
logit_roc_auc_test = roc_auc_score(
y_test, logistic_regression.predict_proba(x_test)[:, 1]
)
fpr, tpr, thresholds = roc_curve(
y_test, logistic_regression.predict_proba(x_test)[:, 1]
)
plt.figure(figsize=(7, 5))
plt.plot(fpr, tpr, label="Logistic Regression (area = %0.2f)" % logit_roc_auc_test)
plt.plot([0, 1], [0, 1], "r--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver operating characteristic")
plt.legend(loc="lower right")
plt.show()
fpr, tpr, thresholds = roc_curve(
y_train, logistic_regression.predict_proba(x_train)[:, 1]
)
optimal_threshold_index = np.argmax(tpr - fpr)
optimal_threshold_auc_roc = thresholds[optimal_threshold_index]
print(optimal_threshold_auc_roc)
0.11717656818045581
# Confusion matrix for train dataset:
create_confusion_matrix(
logistic_regression, x_train, y_train, threshold=optimal_threshold_auc_roc
)
# Calculate the model metrics for train dataset:
logistic_train_metrics_auc_roc = logistic_model_performance(
logistic_regression, x_train, y_train, threshold=optimal_threshold_auc_roc
)
logistic_train_metrics_auc_roc
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.911429 | 0.89426 | 0.518389 | 0.656319 |
# Confusion matrix for test dataset:
create_confusion_matrix(
logistic_regression, x_test, y_test, threshold=optimal_threshold_auc_roc
)
# Calculate the model metrics for test dataset:
logistic_test_metrics_auc_roc = logistic_model_performance(
logistic_regression, x_test, y_test, threshold=optimal_threshold_auc_roc
)
logistic_test_metrics_auc_roc
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.908667 | 0.852349 | 0.524793 | 0.649616 |
y_scores = logistic_regression.predict_proba(x_train)[:, 1]
precision, recall, threshold = precision_recall_curve(
y_train,
y_scores,
)
def plot_prec_recall_vs_tresh(precisions, recalls, thresholds):
plt.plot(thresholds, precisions[:-1], "b--", label="precision")
plt.plot(thresholds, recalls[:-1], "g--", label="recall")
plt.xlabel("Threshold")
plt.legend(loc="upper left")
plt.ylim([0, 1])
plt.figure(figsize=(10, 7))
plot_prec_recall_vs_tresh(precision, recall, threshold)
plt.show()
# setting the threshold
optimal_threshold_curve = 0.38
# Confusion matrix for train dataset:
create_confusion_matrix(
logistic_regression, x_train, y_train, threshold=optimal_threshold_curve
)
# Calculate the model metrics for train dataset:
logistic_train_metrics_pre_recall = logistic_model_performance(
logistic_regression, x_train, y_train, threshold=optimal_threshold_curve
)
logistic_train_metrics_pre_recall
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.953714 | 0.722054 | 0.773463 | 0.746875 |
# Confusion matrix for test dataset:
create_confusion_matrix(
logistic_regression, x_test, y_test, threshold=optimal_threshold_curve
)
# Calculate the model metrics for test dataset:
logistic_test_metrics_pre_recall = logistic_model_performance(
logistic_regression, x_test, y_test, threshold=optimal_threshold_curve
)
logistic_test_metrics_pre_recall
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.943333 | 0.657718 | 0.742424 | 0.697509 |
models_train_comp_df = pd.concat(
[
logistic_train_metrics.T,
logistic_train_metrics_auc_roc.T,
logistic_train_metrics_pre_recall.T,
],
axis=1,
)
models_train_comp_df.columns = [
"First Model - Logistic_Regression_Threshold_0.5",
"Second Model - Logistic_Regression_Threshold_0.117",
"Third Model - Logistic_Regression_Threshold_0.38",
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
| First Model - Logistic_Regression_Threshold_0.5 | Second Model - Logistic_Regression_Threshold_0.117 | Third Model - Logistic_Regression_Threshold_0.38 | |
|---|---|---|---|
| Accuracy | 0.960571 | 0.911429 | 0.953714 |
| Recall | 0.688822 | 0.894260 | 0.722054 |
| Precision | 0.866920 | 0.518389 | 0.773463 |
| F1 | 0.767677 | 0.656319 | 0.746875 |
models_test_comp_df = pd.concat(
[
logistic_test_metrics.T,
logistic_test_metrics_auc_roc.T,
logistic_test_metrics_pre_recall.T,
],
axis=1,
)
models_test_comp_df.columns = [
"First Model - Logistic_Regression_Threshold_0.5",
"Second Model - Logistic_Regression_Threshold_0.117",
"Third Model - Logistic_Regression_Threshold_0.38",
]
print("Testing performance comparison:")
models_test_comp_df
Testing performance comparison:
| First Model - Logistic_Regression_Threshold_0.5 | Second Model - Logistic_Regression_Threshold_0.117 | Third Model - Logistic_Regression_Threshold_0.38 | |
|---|---|---|---|
| Accuracy | 0.946667 | 0.908667 | 0.943333 |
| Recall | 0.577181 | 0.852349 | 0.657718 |
| Precision | 0.834951 | 0.524793 | 0.742424 |
| F1 | 0.682540 | 0.649616 | 0.697509 |
-> Hence, F1 score is used to evaluate the model's performance.
decision_tree_model = DecisionTreeClassifier(
criterion="gini", class_weight={0: 0.15, 1: 0.85}, random_state=1
)
decision_tree_model.fit(x_train, y_train)
DecisionTreeClassifier(class_weight={0: 0.15, 1: 0.85}, random_state=1)
# Create confusion matrix function for Decision Tree model
def DT_confusion_matrix(model, predictor, target):
"""
Description:
This is the function to create confusion matrix and heatmap for Decision Tree model
Inputs:
model: classifier
predictor - independent variables
target - dependent variables
Outputs:
Heatmap plot with confusion matrix values
"""
prediction = model.predict(predictor)
cm = confusion_matrix(target, prediction)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
# Create a function to calculate the model metrics:
# Create a function to compute the model metrics:
def DT_model_metrics(model, predictor, target):
"""
Description:
This is the function to compute the model metrics
Inputs:
model: classifier
predictor - independent variables
target - dependent variables
Outputs:
Model metrics
"""
# Do the prediction:
prediction = model.predict(predictor)
# Calculate the accuracy:
accuracy = accuracy_score(target, prediction)
# Calculate recall:
recall = recall_score(target, prediction)
# Calculate Precision:
precision = precision_score(target, prediction)
# Calculate F1 score:
f1 = f1_score(target, prediction)
# creating a dataframe of metrics
metrics_dataframe = pd.DataFrame(
{
"Accuracy": accuracy,
"Recall": recall,
"Precision": precision,
"F1": f1,
},
index=[0],
)
return metrics_dataframe
# apply confusion matrix on train dataset:
DT_confusion_matrix(decision_tree_model, x_train, y_train)
# calculate DT model metrics on train dataset:
DT_train_metrics = DT_model_metrics(decision_tree_model, x_train, y_train)
DT_train_metrics
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 1.0 | 1.0 | 1.0 | 1.0 |
# apply confusion matrix on test dataset:
DT_confusion_matrix(decision_tree_model, x_test, y_test)
# calculate DT model metrics on test dataset:
DT_test_metrics = DT_model_metrics(decision_tree_model, x_test, y_test)
DT_test_metrics
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.973333 | 0.845638 | 0.881119 | 0.863014 |
feature_names = x_train.columns.tolist()
# plot the model
plt.figure(figsize=(20, 30))
output = tree.plot_tree(
decision_tree_model,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
)
# below code will add arrows to the decision tree split if they are missing
for line in output:
arrow = line.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
# Text report showing the rules of a decision tree -
print(
tree.export_text(
decision_tree_model, feature_names=feature_names, show_weights=True
)
)
|--- Income <= 98.50 | |--- CCAvg <= 2.95 | | |--- weights: [374.10, 0.00] class: 0 | |--- CCAvg > 2.95 | | |--- CD_Account <= 0.50 | | | |--- CCAvg <= 3.95 | | | | |--- Income <= 81.50 | | | | | |--- Age <= 36.50 | | | | | | |--- Education <= 1.50 | | | | | | | |--- weights: [0.60, 0.00] class: 0 | | | | | | |--- Education > 1.50 | | | | | | | |--- CCAvg <= 3.50 | | | | | | | | |--- County_Santa Clara County <= 0.50 | | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | | | | |--- County_Santa Clara County > 0.50 | | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | | | |--- CCAvg > 3.50 | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | |--- Age > 36.50 | | | | | | |--- County_Los Angeles County <= 0.50 | | | | | | | |--- weights: [5.25, 0.00] class: 0 | | | | | | |--- County_Los Angeles County > 0.50 | | | | | | | |--- Experience <= 20.00 | | | | | | | | |--- Experience <= 17.50 | | | | | | | | | |--- weights: [0.30, 0.00] class: 0 | | | | | | | | |--- Experience > 17.50 | | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | | | |--- Experience > 20.00 | | | | | | | | |--- weights: [1.05, 0.00] class: 0 | | | | |--- Income > 81.50 | | | | | |--- Mortgage <= 152.00 | | | | | | |--- County_Los Angeles County <= 0.50 | | | | | | | |--- Securities_Account <= 0.50 | | | | | | | | |--- County_Yolo County <= 0.50 | | | | | | | | | |--- Family <= 3.50 | | | | | | | | | | |--- Income <= 84.00 | | | | | | | | | | | |--- weights: [0.00, 5.10] class: 1 | | | | | | | | | | |--- Income > 84.00 | | | | | | | | | | | |--- truncated branch of depth 4 | | | | | | | | | |--- Family > 3.50 | | | | | | | | | | |--- Education <= 2.50 | | | | | | | | | | | |--- weights: [0.60, 0.00] class: 0 | | | | | | | | | | |--- Education > 2.50 | | | | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | | | | |--- County_Yolo County > 0.50 | | | | | | | | | |--- Experience <= 17.50 | | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | | | |--- Experience > 17.50 | | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | |--- Securities_Account > 0.50 | | | | | | | | |--- CreditCard <= 0.50 | | | | | | | | | |--- weights: [0.45, 0.00] class: 0 | | | | | | | | |--- CreditCard > 0.50 | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | |--- County_Los Angeles County > 0.50 | | | | | | | |--- CCAvg <= 3.15 | | | | | | | | |--- weights: [0.30, 0.00] class: 0 | | | | | | | |--- CCAvg > 3.15 | | | | | | | | |--- weights: [0.45, 0.00] class: 0 | | | | | |--- Mortgage > 152.00 | | | | | | |--- CCAvg <= 3.05 | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | |--- CCAvg > 3.05 | | | | | | | |--- weights: [0.90, 0.00] class: 0 | | | |--- CCAvg > 3.95 | | | | |--- weights: [6.75, 0.00] class: 0 | | |--- CD_Account > 0.50 | | | |--- CCAvg <= 4.50 | | | | |--- weights: [0.00, 6.80] class: 1 | | | |--- CCAvg > 4.50 | | | | |--- weights: [0.15, 0.00] class: 0 |--- Income > 98.50 | |--- Education <= 1.50 | | |--- Family <= 2.50 | | | |--- Income <= 100.00 | | | | |--- CCAvg <= 4.20 | | | | | |--- weights: [0.45, 0.00] class: 0 | | | | |--- CCAvg > 4.20 | | | | | |--- Age <= 54.50 | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | |--- Age > 54.50 | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | |--- Income > 100.00 | | | | |--- Income <= 103.50 | | | | | |--- Securities_Account <= 0.50 | | | | | | |--- weights: [2.10, 0.00] class: 0 | | | | | |--- Securities_Account > 0.50 | | | | | | |--- County_Los Angeles County <= 0.50 | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | |--- County_Los Angeles County > 0.50 | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | |--- Income > 103.50 | | | | | |--- Income <= 104.50 | | | | | | |--- weights: [0.30, 0.00] class: 0 | | | | | |--- Income > 104.50 | | | | | | |--- weights: [64.65, 0.00] class: 0 | | |--- Family > 2.50 | | | |--- Income <= 108.50 | | | | |--- Experience <= 3.50 | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | |--- Experience > 3.50 | | | | | |--- weights: [1.20, 0.00] class: 0 | | | |--- Income > 108.50 | | | | |--- Age <= 26.00 | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | |--- Age > 26.00 | | | | | |--- Income <= 113.50 | | | | | | |--- Experience <= 31.50 | | | | | | | |--- Income <= 112.00 | | | | | | | | |--- Experience <= 13.00 | | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | | | | |--- Experience > 13.00 | | | | | | | | | |--- weights: [0.00, 2.55] class: 1 | | | | | | | |--- Income > 112.00 | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | |--- Experience > 31.50 | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | |--- Income > 113.50 | | | | | | |--- weights: [0.00, 41.65] class: 1 | |--- Education > 1.50 | | |--- Income <= 116.50 | | | |--- CCAvg <= 2.80 | | | | |--- Income <= 106.50 | | | | | |--- weights: [5.40, 0.00] class: 0 | | | | |--- Income > 106.50 | | | | | |--- Experience <= 31.50 | | | | | | |--- Age <= 41.50 | | | | | | | |--- County_San Diego County <= 0.50 | | | | | | | | |--- CCAvg <= 1.75 | | | | | | | | | |--- CCAvg <= 1.55 | | | | | | | | | | |--- weights: [1.35, 0.00] class: 0 | | | | | | | | | |--- CCAvg > 1.55 | | | | | | | | | | |--- CCAvg <= 1.65 | | | | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | | | | | | |--- CCAvg > 1.65 | | | | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | | | | |--- CCAvg > 1.75 | | | | | | | | | |--- Age <= 26.00 | | | | | | | | | | |--- weights: [0.45, 0.00] class: 0 | | | | | | | | | |--- Age > 26.00 | | | | | | | | | | |--- weights: [2.25, 0.00] class: 0 | | | | | | | |--- County_San Diego County > 0.50 | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | | |--- Age > 41.50 | | | | | | | |--- Online <= 0.50 | | | | | | | | |--- weights: [0.45, 0.00] class: 0 | | | | | | | |--- Online > 0.50 | | | | | | | | |--- County_Contra Costa County <= 0.50 | | | | | | | | | |--- County_Yolo County <= 0.50 | | | | | | | | | | |--- County_Santa Clara County <= 0.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | | |--- County_Santa Clara County > 0.50 | | | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | | | |--- County_Yolo County > 0.50 | | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | | |--- County_Contra Costa County > 0.50 | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | |--- Experience > 31.50 | | | | | | |--- weights: [1.50, 0.00] class: 0 | | | |--- CCAvg > 2.80 | | | | |--- Age <= 63.50 | | | | | |--- County_Santa Barbara County <= 0.50 | | | | | | |--- County_Yolo County <= 0.50 | | | | | | | |--- Family <= 1.50 | | | | | | | | |--- Online <= 0.50 | | | | | | | | | |--- weights: [0.00, 1.70] class: 1 | | | | | | | | |--- Online > 0.50 | | | | | | | | | |--- weights: [0.45, 0.00] class: 0 | | | | | | | |--- Family > 1.50 | | | | | | | | |--- Income <= 99.50 | | | | | | | | | |--- Mortgage <= 250.75 | | | | | | | | | | |--- weights: [0.30, 0.00] class: 0 | | | | | | | | | |--- Mortgage > 250.75 | | | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | | | | |--- Income > 99.50 | | | | | | | | | |--- Experience <= 35.50 | | | | | | | | | | |--- County_Alameda County <= 0.50 | | | | | | | | | | | |--- weights: [0.00, 12.75] class: 1 | | | | | | | | | | |--- County_Alameda County > 0.50 | | | | | | | | | | | |--- weights: [0.00, 3.40] class: 1 | | | | | | | | | |--- Experience > 35.50 | | | | | | | | | | |--- Mortgage <= 126.25 | | | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | | | | |--- Mortgage > 126.25 | | | | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | | |--- County_Yolo County > 0.50 | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | |--- County_Santa Barbara County > 0.50 | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | |--- Age > 63.50 | | | | | |--- weights: [0.30, 0.00] class: 0 | | |--- Income > 116.50 | | | |--- Securities_Account <= 0.50 | | | | |--- weights: [0.00, 165.75] class: 1 | | | |--- Securities_Account > 0.50 | | | | |--- weights: [0.00, 22.95] class: 1
importances = decision_tree_model.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
# Choose the type of classifier.
estimator = DecisionTreeClassifier(random_state=1, class_weight={0: 0.15, 1: 0.85})
# Grid of parameters to choose from
parameters = {
"max_depth": [5, 10, 15, None],
"criterion": ["entropy", "gini"],
"splitter": ["best", "random"],
"min_impurity_decrease": [0.00001, 0.0001, 0.01],
}
# Type of scoring used to compare parameter combinations
scorer = make_scorer(recall_score)
# Run the grid search
grid_obj = GridSearchCV(estimator, parameters, scoring=scorer, cv=5)
grid_obj = grid_obj.fit(x_train, y_train)
# Set the clf to the best combination of parameters
estimator = grid_obj.best_estimator_
# Fit the best algorithm to the data.
estimator.fit(x_train, y_train)
DecisionTreeClassifier(class_weight={0: 0.15, 1: 0.85}, max_depth=5,
min_impurity_decrease=1e-05, random_state=1)
# apply confusion matrix on train dataset:
DT_confusion_matrix(estimator, x_train, y_train)
# calculate DT model metrics on train dataset:
DT_train_metrics_gridsearch = DT_model_metrics(estimator, x_train, y_train)
DT_train_metrics_gridsearch
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.985714 | 0.966767 | 0.891365 | 0.927536 |
# apply confusion matrix on test dataset:
DT_confusion_matrix(estimator, x_test, y_test)
# calculate DT model metrics on test dataset:
DT_test_metrics_gridsearch = DT_model_metrics(estimator, x_test, y_test)
DT_test_metrics_gridsearch
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.972 | 0.892617 | 0.836478 | 0.863636 |
plt.figure(figsize=(15, 10))
tree.plot_tree(
estimator,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=True,
class_names=True,
)
plt.show()
# Text report showing the rules of a decision tree -
print(tree.export_text(estimator, feature_names=feature_names, show_weights=True))
|--- Income <= 98.50 | |--- CCAvg <= 2.95 | | |--- weights: [374.10, 0.00] class: 0 | |--- CCAvg > 2.95 | | |--- CD_Account <= 0.50 | | | |--- CCAvg <= 3.95 | | | | |--- Income <= 81.50 | | | | | |--- weights: [7.35, 2.55] class: 0 | | | | |--- Income > 81.50 | | | | | |--- weights: [4.35, 9.35] class: 1 | | | |--- CCAvg > 3.95 | | | | |--- weights: [6.75, 0.00] class: 0 | | |--- CD_Account > 0.50 | | | |--- CCAvg <= 4.50 | | | | |--- weights: [0.00, 6.80] class: 1 | | | |--- CCAvg > 4.50 | | | | |--- weights: [0.15, 0.00] class: 0 |--- Income > 98.50 | |--- Education <= 1.50 | | |--- Family <= 2.50 | | | |--- Income <= 100.00 | | | | |--- CCAvg <= 4.20 | | | | | |--- weights: [0.45, 0.00] class: 0 | | | | |--- CCAvg > 4.20 | | | | | |--- weights: [0.00, 1.70] class: 1 | | | |--- Income > 100.00 | | | | |--- Income <= 103.50 | | | | | |--- weights: [2.25, 0.85] class: 0 | | | | |--- Income > 103.50 | | | | | |--- weights: [64.95, 0.00] class: 0 | | |--- Family > 2.50 | | | |--- Income <= 108.50 | | | | |--- Experience <= 3.50 | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | |--- Experience > 3.50 | | | | | |--- weights: [1.20, 0.00] class: 0 | | | |--- Income > 108.50 | | | | |--- Age <= 26.00 | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | |--- Age > 26.00 | | | | | |--- weights: [0.30, 45.05] class: 1 | |--- Education > 1.50 | | |--- Income <= 116.50 | | | |--- CCAvg <= 2.80 | | | | |--- Income <= 106.50 | | | | | |--- weights: [5.40, 0.00] class: 0 | | | | |--- Income > 106.50 | | | | | |--- weights: [6.45, 5.95] class: 0 | | | |--- CCAvg > 2.80 | | | | |--- Age <= 63.50 | | | | | |--- weights: [1.20, 19.55] class: 1 | | | | |--- Age > 63.50 | | | | | |--- weights: [0.30, 0.00] class: 0 | | |--- Income > 116.50 | | | |--- weights: [0.00, 188.70] class: 1
Observeration
Interpretations from other decision rules can be made similarly
importances = estimator.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
clf = DecisionTreeClassifier(random_state=1)
path = clf.cost_complexity_pruning_path(x_train, y_train)
ccp_alphas, impurities = path.ccp_alphas, path.impurities
pd.DataFrame(path)
| ccp_alphas | impurities | |
|---|---|---|
| 0 | 0.000000 | 0.000000 |
| 1 | 0.000186 | 0.000559 |
| 2 | 0.000187 | 0.001121 |
| 3 | 0.000269 | 0.002195 |
| 4 | 0.000270 | 0.002735 |
| 5 | 0.000273 | 0.004371 |
| 6 | 0.000359 | 0.005447 |
| 7 | 0.000381 | 0.005828 |
| 8 | 0.000381 | 0.006209 |
| 9 | 0.000381 | 0.006590 |
| 10 | 0.000381 | 0.006971 |
| 11 | 0.000408 | 0.007787 |
| 12 | 0.000476 | 0.008263 |
| 13 | 0.000514 | 0.009805 |
| 14 | 0.000527 | 0.010332 |
| 15 | 0.000578 | 0.012646 |
| 16 | 0.000582 | 0.013228 |
| 17 | 0.000607 | 0.013835 |
| 18 | 0.000621 | 0.014456 |
| 19 | 0.000882 | 0.017985 |
| 20 | 0.001552 | 0.019536 |
| 21 | 0.002333 | 0.021869 |
| 22 | 0.003024 | 0.024893 |
| 23 | 0.003294 | 0.028187 |
| 24 | 0.006473 | 0.034659 |
| 25 | 0.023866 | 0.058525 |
| 26 | 0.056365 | 0.171255 |
fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(ccp_alphas[:-1], impurities[:-1], marker="o", drawstyle="steps-post")
ax.set_xlabel("effective alpha")
ax.set_ylabel("total impurity of leaves")
ax.set_title("Total Impurity vs effective alpha for training set")
plt.show()
clfs = []
for ccp_alpha in ccp_alphas:
clf = DecisionTreeClassifier(random_state=1, ccp_alpha=ccp_alpha)
clf.fit(x_train, y_train)
clfs.append(clf)
print(
"Number of nodes in the last tree is: {} with ccp_alpha: {}".format(
clfs[-1].tree_.node_count, ccp_alphas[-1]
)
)
Number of nodes in the last tree is: 1 with ccp_alpha: 0.056364969335601575
clfs
[DecisionTreeClassifier(random_state=1), DecisionTreeClassifier(ccp_alpha=0.00018633540372670792, random_state=1), DecisionTreeClassifier(ccp_alpha=0.00018719211822660126, random_state=1), DecisionTreeClassifier(ccp_alpha=0.00026869682042095835, random_state=1), DecisionTreeClassifier(ccp_alpha=0.0002699474438604879, random_state=1), DecisionTreeClassifier(ccp_alpha=0.0002726902726902725, random_state=1), DecisionTreeClassifier(ccp_alpha=0.00035854341736694684, random_state=1), DecisionTreeClassifier(ccp_alpha=0.0003809523809523809, random_state=1), DecisionTreeClassifier(ccp_alpha=0.0003809523809523809, random_state=1), DecisionTreeClassifier(ccp_alpha=0.0003809523809523809, random_state=1), DecisionTreeClassifier(ccp_alpha=0.0003809523809523809, random_state=1), DecisionTreeClassifier(ccp_alpha=0.00040816326530612246, random_state=1), DecisionTreeClassifier(ccp_alpha=0.0004761904761904762, random_state=1), DecisionTreeClassifier(ccp_alpha=0.000513818320269933, random_state=1), DecisionTreeClassifier(ccp_alpha=0.0005274725274725272, random_state=1), DecisionTreeClassifier(ccp_alpha=0.0005784151498437213, random_state=1), DecisionTreeClassifier(ccp_alpha=0.0005820274055568174, random_state=1), DecisionTreeClassifier(ccp_alpha=0.0006068918804504296, random_state=1), DecisionTreeClassifier(ccp_alpha=0.0006209286209286216, random_state=1), DecisionTreeClassifier(ccp_alpha=0.0008822052285337754, random_state=1), DecisionTreeClassifier(ccp_alpha=0.0015516446679237376, random_state=1), DecisionTreeClassifier(ccp_alpha=0.002333060640147255, random_state=1), DecisionTreeClassifier(ccp_alpha=0.003023521760901227, random_state=1), DecisionTreeClassifier(ccp_alpha=0.003293801935470495, random_state=1), DecisionTreeClassifier(ccp_alpha=0.006472814718223811, random_state=1), DecisionTreeClassifier(ccp_alpha=0.02386594448205822, random_state=1), DecisionTreeClassifier(ccp_alpha=0.056364969335601575, random_state=1)]
clfs = clfs[:-1]
ccp_alphas = ccp_alphas[:-1]
node_counts = [clf.tree_.node_count for clf in clfs]
depth = [clf.tree_.max_depth for clf in clfs]
fig, ax = plt.subplots(2, 1, figsize=(10, 7))
ax[0].plot(ccp_alphas, node_counts, marker="o", drawstyle="steps-post")
ax[0].set_xlabel("alpha")
ax[0].set_ylabel("number of nodes")
ax[0].set_title("Number of nodes vs alpha")
ax[1].plot(ccp_alphas, depth, marker="o", drawstyle="steps-post")
ax[1].set_xlabel("alpha")
ax[1].set_ylabel("depth of tree")
ax[1].set_title("Depth vs alpha")
fig.tight_layout()
recall_train = []
for clf in clfs:
pred_train3 = clf.predict(x_train)
values_train = metrics.recall_score(y_train, pred_train3)
recall_train.append(values_train)
recall_test = []
for clf in clfs:
pred_test = clf.predict(x_test)
values_test = recall_score(y_test, pred_test)
recall_test.append(values_test)
fig, ax = plt.subplots(figsize=(15, 5))
ax.set_xlabel("alpha")
ax.set_ylabel("Recall")
ax.set_title("Recall vs alpha for training and testing sets")
ax.plot(ccp_alphas, recall_train, marker="o", label="train", drawstyle="steps-post")
ax.plot(ccp_alphas, recall_test, marker="o", label="test", drawstyle="steps-post")
ax.legend()
plt.show()
# creating the model where we get highest train and test recall
index_best_model = np.argmax(recall_test)
best_model = clfs[index_best_model]
print(best_model)
DecisionTreeClassifier(ccp_alpha=0.0006209286209286216, random_state=1)
# calculate DT model metrics on train dataset:
post_pruning_train_metrics = DT_model_metrics(best_model, x_train, y_train)
post_pruning_train_metrics
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.991429 | 0.945619 | 0.963077 | 0.954268 |
# apply confusion matrix on train dataset:
DT_confusion_matrix(best_model, x_train, y_train)
# calculate DT model metrics on test dataset:
post_pruning_test_metrics = DT_model_metrics(best_model, x_test, y_test)
post_pruning_test_metrics
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.982667 | 0.892617 | 0.93007 | 0.910959 |
plt.figure(figsize=(10, 10))
out = tree.plot_tree(
best_model,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=True,
class_names=True,
)
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
plt.show()
# Text report showing the rules of a decision tree -
print(tree.export_text(best_model, feature_names=feature_names, show_weights=True))
|--- Income <= 116.50 | |--- CCAvg <= 2.95 | | |--- Income <= 106.50 | | | |--- weights: [2553.00, 0.00] class: 0 | | |--- Income > 106.50 | | | |--- Family <= 3.50 | | | | |--- weights: [63.00, 3.00] class: 0 | | | |--- Family > 3.50 | | | | |--- Age <= 32.50 | | | | | |--- weights: [12.00, 1.00] class: 0 | | | | |--- Age > 32.50 | | | | | |--- Age <= 60.00 | | | | | | |--- weights: [0.00, 6.00] class: 1 | | | | | |--- Age > 60.00 | | | | | | |--- weights: [4.00, 0.00] class: 0 | |--- CCAvg > 2.95 | | |--- Income <= 92.50 | | | |--- CD_Account <= 0.50 | | | | |--- weights: [117.00, 10.00] class: 0 | | | |--- CD_Account > 0.50 | | | | |--- weights: [0.00, 5.00] class: 1 | | |--- Income > 92.50 | | | |--- Education <= 1.50 | | | | |--- CD_Account <= 0.50 | | | | | |--- weights: [33.00, 4.00] class: 0 | | | | |--- CD_Account > 0.50 | | | | | |--- weights: [1.00, 5.00] class: 1 | | | |--- Education > 1.50 | | | | |--- weights: [11.00, 28.00] class: 1 |--- Income > 116.50 | |--- Education <= 1.50 | | |--- Family <= 2.50 | | | |--- weights: [375.00, 0.00] class: 0 | | |--- Family > 2.50 | | | |--- weights: [0.00, 47.00] class: 1 | |--- Education > 1.50 | | |--- weights: [0.00, 222.00] class: 1
# importance of features in the tree building, is also known as Gini importance
print(
pd.DataFrame(
best_model.feature_importances_, columns=["Imp"], index=x_train.columns
).sort_values(by="Imp", ascending=False)
)
Imp Education 0.437917 Income 0.325272 Family 0.156373 CCAvg 0.041281 CD_Account 0.024775 Age 0.014382 County_Santa Barbara County 0.000000 County_Riverside County 0.000000 County_Sacramento County 0.000000 County_San Benito County 0.000000 County_San Bernardino County 0.000000 County_San Diego County 0.000000 County_San Francisco County 0.000000 County_San Joaquin County 0.000000 County_San Luis Obispo County 0.000000 County_San Mateo County 0.000000 County_Shasta County 0.000000 County_Santa Clara County 0.000000 County_Santa Cruz County 0.000000 County_Orange County 0.000000 County_Siskiyou County 0.000000 County_Solano County 0.000000 County_Sonoma County 0.000000 County_Stanislaus County 0.000000 County_Trinity County 0.000000 County_Tuolumne County 0.000000 County_Ventura County 0.000000 County_Placer County 0.000000 County_Merced County 0.000000 County_Napa County 0.000000 County_Contra Costa County 0.000000 Mortgage 0.000000 Securities_Account 0.000000 Online 0.000000 CreditCard 0.000000 County_92717 0.000000 County_93077 0.000000 County_96651 0.000000 County_Alameda County 0.000000 County_Butte County 0.000000 County_El Dorado County 0.000000 County_Monterey County 0.000000 County_Fresno County 0.000000 County_Humboldt County 0.000000 County_Imperial County 0.000000 County_Kern County 0.000000 County_Lake County 0.000000 County_Los Angeles County 0.000000 County_Marin County 0.000000 County_Mendocino County 0.000000 Experience 0.000000 County_Yolo County 0.000000
importances = best_model.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
models_train_comp_df = pd.concat(
[
logistic_train_metrics.T,
logistic_train_metrics_auc_roc.T,
logistic_train_metrics_pre_recall.T,
DT_train_metrics.T,
DT_train_metrics_gridsearch.T,
post_pruning_train_metrics.T,
],
axis=1,
)
models_train_comp_df.columns = [
"Logistic_Regression_Threshold_0.5",
"Logistic_Regression_Threshold_0.117",
"Logistic_Regression_Threshold_0.38",
"Decision Tree",
"Pre-Pruning Decision Tree",
"Post-Pruning Decision Tree",
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
| Logistic_Regression_Threshold_0.5 | Logistic_Regression_Threshold_0.117 | Logistic_Regression_Threshold_0.38 | Decision Tree | Pre-Pruning Decision Tree | Post-Pruning Decision Tree | |
|---|---|---|---|---|---|---|
| Accuracy | 0.960571 | 0.911429 | 0.953714 | 1.0 | 0.985714 | 0.991429 |
| Recall | 0.688822 | 0.894260 | 0.722054 | 1.0 | 0.966767 | 0.945619 |
| Precision | 0.866920 | 0.518389 | 0.773463 | 1.0 | 0.891365 | 0.963077 |
| F1 | 0.767677 | 0.656319 | 0.746875 | 1.0 | 0.927536 | 0.954268 |
models_test_comp_df = pd.concat(
[
logistic_test_metrics.T,
logistic_test_metrics_auc_roc.T,
logistic_test_metrics_pre_recall.T,
DT_test_metrics.T,
DT_test_metrics_gridsearch.T,
post_pruning_test_metrics.T,
],
axis=1,
)
models_test_comp_df.columns = [
"Logistic_Regression_Threshold_0.5",
"Logistic_Regression_Threshold_0.117",
"Logistic_Regression_Threshold_0.38",
"Decision Tree",
"Pre-Pruning Decision Tree",
"Post-Pruning Decision Tree",
]
print("Testing performance comparison:")
models_test_comp_df
Testing performance comparison:
| Logistic_Regression_Threshold_0.5 | Logistic_Regression_Threshold_0.117 | Logistic_Regression_Threshold_0.38 | Decision Tree | Pre-Pruning Decision Tree | Post-Pruning Decision Tree | |
|---|---|---|---|---|---|---|
| Accuracy | 0.946667 | 0.908667 | 0.943333 | 0.973333 | 0.972000 | 0.982667 |
| Recall | 0.577181 | 0.852349 | 0.657718 | 0.845638 | 0.892617 | 0.892617 |
| Precision | 0.834951 | 0.524793 | 0.742424 | 0.881119 | 0.836478 | 0.930070 |
| F1 | 0.682540 | 0.649616 | 0.697509 | 0.863014 | 0.863636 | 0.910959 |
Observation
Education, Income, Family, CCAvg, CD_Account, and Age are the most important factors to predict if a customer will buy a personal loan or not.
If income of a customer is more than 116 thousand dollars, there are two scenarios:
If income of customer ranges from 106 to 116 thousand dollars and average spending on credit card less than 3 thousand dollars, the marketing team should target the family size that has more than 3 people, and the adults with age range from 32 to 60 years old, the customers will likey buy the personal loan.
If the average spending on credit card more than 3 thousand dollars, income is less than 92 thousand dollars and have certificate of deposit (CD) account, the customers will likely buy the personal loan.
If the average spending on credit card more than 3 thousand dollars, income is more than 92 thousand dollars, have CD account but the education is only undergraduate or less, the customers will likely buy the personal loan. If customers that have higher education but dont' have CD account, they will also buy the personal loan. The marketing team should target more with these type of customers.